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Abstract 

Classical penalty methods solve a sequence of unconstrained problems that put greater and greater 
stress on meeting the constraints. In the limit as the penalty constant tends to oo, one recovers the 
constrained solution. In the exact penalty method, squared penalties are replaced by absolute value 
penalties, and the solution is recovered for a finite value of the penalty constant. In practice, the kinks 
in the penalty and the unknown magnitude of the penalty constant prevent wide application of the exact 
penalty method in nonlinear programming. In this article, we examine a strategy of path following con- 
sistent with the exact penalty method. Instead of performing optimization at a single penalty constant, 
we trace the solution as a continuous function of the penalty constant. Thus, path following starts at the 
unconstrained solution and follows the solution path as the penalty constant increases. In the process, 
the solution path hits, slides along, and exits from the various constraints. For quadratic programming, 
the solution path is piecewise linear and takes large jumps from constraint to constraint. For a gen- 
eral convex program, the solution path is piecewise smooth, and path following operates by numerically 
solving an ordinary differential equation segment by segment. Our diverse applications to a) projection 
onto a convex set, b) nonnegative least squares, c) quadratically constrained quadratic programming, 
d) geometric programming, and e) semidefinite programming illustrate the mechanics and potential of 
path following. The final detour to image denoising demonstrates the relevance of path following to 
regularized estimation in inverse problems. In regularized estimation, one follows the solution path as 
the penalty constant decreases from a large value. 

Keywords: constrained convex optimization, exact penalty, geometric programming, ordinary differ- 
ential equation, quadratically constrained quadratic programming, regularization, semidefinite program- 
ming 
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19671 ). The general idea is to replace hard constraints by penalties or barriers and then exploit the well-oiled 



machinery for solving unconstrained problems. Penalty methods operate on the exterior of the feasible region 
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and barrier methods on the interior. The strength of a penalty or barrier is determined by a tuning constant. 
In classical penalty methods, a single global tuning constant is gradually sent to oo; in barrier methods, it 
is gradually sent to 0. Either strategy generates a sequence of solutions that converges in practice to the 
solution of the original constrained optimization problem. 

Barrier methods are now generally conceded to offer a better approach to solving convex programs than 
penalty methods. Application of log barriers and carefully controlled versions of Newton's method make it 



possi ble to follow the central path reliably and quickly to the constrained minimum (jBovd and Vand enberghe 



2004 ). Nonetheless, penalty meth ods should not be ru l ed ou t. Augmented Lagrangian methods 



Hestenes 



19751 ) and exact penalty methods (jNocedal and Wright 



20061) are potentially competitive with interior point 



methods for smooth convex programming problems. Both methods have the advantage that the solution 
of the constrained problem kicks in for a finite value of the penalty constant. This avoids problems of ill 
conditioning as the penalty constant tends to oo. 

The disadvantage of exact penalties over traditional quadratic penalties is lack of differentiability of 
the penalized objective function. In the current paper, we argue that this impediment can be finessed by 
path following. Our path following method starts at the unconstrained solution and follows the solution 
path as the penalty constant increases. In the process, the solution path hits, exits, and slides along the 
various constraint boundaries. The path itself is piecewise smooth with kinks at the boundary hitting and 
escape times. One advances along the path by numerically solving a differential equation for the Lagrange 
multipliers of the penalized problem. In the special case of quadratic programming with affine constraints , 



the so lution path is piecewise linear, and one can easily anticipate entire path segments (Zhou and 



2011bl ). This special case is intimately related to the linear complementarity problem (jCottle et al 
in optimization theory. 

Homotopy (continuation) methods for the solution of nonlinear e quations and optimization 



Lange 



1992) 



have been pursued for many years and enjoyed a variety of successes (jNocedal and Wright 



1986 



0001 



Zangwill and Garcia . 



2006 



problems 



Watson 



19811 ). To our knowledge, however, there has been no exploration of path 



following as an implementation of the exact penalty method. Our modest goal here is to assess the feasibility 
and versatility of exact path following for constrained optimization. Comparing its performance to existing 
methods, particularly the interior point method, is probably best left for later, more practically oriented 
papers. In our experience, coding the algorithm is straightforward in Matlab. The rich numerical resources 
of Matlab include differential equation solvers that alert the user when certain events such as constraint 
hitting and escape occur. 

The rest of the paper is organized as follows. Section [5] briefly reviews the exact penalty method for 
optimization and investigates sufficient conditions for uniqueness and continuity of the solution path. Section 



2 



[3] derives the path following strategy for general convex programs, with particular attention to the special 
cases of quadratic programming and convex optimization with affine constraints. Section U presents various 
applications of the path algorithm. Our most elaborate example demonstrates the relevance of path following 
to regularized estimation. The particular prob lem treated, image d enoising, is typical of many inverse 



problems in applied mathematics and statistics 



Zhou and Wul ([20111 ). In such problems one follows the 



solution path as the penalty constant decreases. Finally, Section [5] discusses the limitations of the path 
algorithm and hints at future generalizations. 

2 Exact Penalty Methods 

In this paper we consider the convex programming problem of minimizing the convex objective function 
f(x) subject to r affine equality constraints gi{x) — and s convex inequality constraints hj(x) < 0. We 
will further assume that f(x) and the hj(x) are twice diffcrentiablc. The differential df(x) is the row vector 
of partial derivatives of f(x); the gradient V f(x) is the transpose of df(x). The second differential d 2 f(x) 
is the Hessian matrix of second partial derivatives of f(x). Similar conventions hold for the differentials of 
the constraint functions. 



Exact penalty methods ([Nocedal and Wright 



2006 



Ruszczvhski 



20061) minimize the surrogate function 

r s 

£pO) = /O) + pY1 \9i( x )\ +P^ max {°'^( :E )}- (!) 

i=l j=l 

This definition of £ p (x) is meaningful regardless of whether the contributing functions are convex. If the 
program is convex, then £ p {x) is itself convex. It is interesting to compare £ p {x) to the Lagrangian function 

r s 

C(x) = f(x)+^2x i g i (x)+^2n j h j {x), 
i=i j=i 

which captures the behavior of f(x) near a constrained local minimum y. The Lagrangian satisfies the sta- 
tionarity condition V£(y) =0; its inequality multipliers /ij are nonnegative and satisfy the complementary 
slackness conditions fj,jhj(y) — 0. In an exact penalty method one takes 



p > max{|Ai| 
This choice creates the favorable circumstances 



l-M, mi 



(2) 



C(x) < £ P (x) for all cc 

£-{ z ) < f{ z ) — £p{ z ) for all feasible z 

£(y) = f(y) = £ P (y) for y optimal 

with profound consequences. As the next proposition proves, minimizing £ p (x) is effective in minimizing 
f(x) subject to the constraints. 



Proposition 2.1. Suppose the objective function f(x) and the constraint functions are twice differentiable 
and satisfy the Lagrange multiplier rule at the local minimum y. If inequality holds and v*d 2 C{y)v > 
for every vector v ^ satisfying dgi(y)v = and dhj(y)v < for all active inequality constraints, then 
y furnishes an unconstrained local minimum of £ p (x). For a convex program satisfying Slater's constraint 
qualification and inequality (0), y is a minimum of £ p {x) if and only if y is a minimum of f(x) subject to 
the constraints. No differentiability assumptions are required for convex programs. 

Proof. The conditions imposed on the quadratic form v*d 2 £(y)v are w e ll-kno wn sufficient conditions for 



a local minimum. Theorems 6.9 and 7.21 of the reference 



Ruszczvhskil (|2006l) prove all of the foregoing 



assertions. □ 
As previously stressed, the exact penalty method turns a constrained optimization prob lem into an uncon- 



strain ed minimization problem. Furthermore, in contrast to the quadratic penalty method (jNocedal and Wright 



2006, Section 17.1), the constrained solution in the exact method is achieved for a finite value of p. Despite 
these advantages, minimizing the surrogate function £ P {x) is complicated. For one thing, it is no longer 
globally differentiable. For another, one must minimize £ P {x) along an increasing sequence p n because the 
Lagrange multipliers ^ are usually unknown in advance. These hurdles have prevented wide application of 
exact penalty methods in convex programming. 

As a prelude to our derivation of the path following algorithm for convex programs, we record several 
properties of £ P (x) that mitigate the failure of differentiability. 

Proposition 2.2. The surrogate function £ p (x) is increasing in p. Furthermore, £ p (x) is strictly convex 
for one p > if and only if it is strictly convex for all p > 0. Likewise, it is coercive for one p > if and 
only if is coercive for all p > 0. Finally, if f(x) is strictly convex (or coercive), then all £ P {x) are strictly 
convex (or coercive). 

Proof. The first assertion is obvious. For the second assertion, consider more generally a finite family 
u\(x), . . . , u q (x) of convex functions, and suppose a linear combination 53|=i c kUk(x) with positive coeffi- 
cients is strictly convex. It suffices to prove that any other linear combination Ylk—i °kU k (x) with positive 
coefficients is strictly convex. For any two points x ^ y and any scalar a € (0, 1), we have 

u k [ax + (1 - a)y] < au k (x) + (1 - a)u k (y). (3) 

Since J2 k =i c k u k(x) * s strictly convex, strict inequality must hold for at least one k. Hence, multiplying 
inequality ([3]) by b k and adding gives 

9 9 9 

^2 hu k [ax + (1 - a)y] < b k u k (x) + (1 - a) b k u k (y). 

k=l k=l k=l 



The third assertion follo ws from the fact that a convex function is coercive if and only if its restriction to 



each half-line is coercive ([Bertsekas , 



2003, Proposition 3.2.2). Given this result, suppose £ P (x) is coercive, 
but £p*(x) is not coercive. Then there exists a point x, a direction v, and a sequence of scalars t n tending 
to oo such that £ p *(x + t n v) is bounded above. This requires the sequence f(x + t n v) and each of the 
sequences \g%{x + t n v)\ and max{0, hj(x + t n v) to remain bounded above. But in this circumstance the 
sequence £ p (x + t n v) also remains bounded above. The final two assertions are obvious. □ 

3 The Path Following Algorithm 

In this section, wc take a different point of view. Instead of minimizing £ p {x) for an increasing sequence p n , 
we study how the solution x{p) changes continuously with p and devise a path following strategy starting 
from p — 0. For some finite value of p, the path locks in on the solution of the original convex program. 
In regularized statistical estimation and inverse problems, the primary goal is to select relevant predictors 



single 


point along it ( 


Efron et al.. 


2004: 


2011b; 


Zhou and Wu 


2011 


). Although < 



Osborne et al 



2000: 



Tibshirani and Taylor 



2011 



Zhou and Lange 



20111 ). Although our theory will focus on constrained estimation, readers should bear 
in mind this second application area of path following. 

The path algorithm relies critically on the first order optimality condition that characterizes the optimum 
point of the convex function £ p (y). 

Proposition 3.1. For a convex program, a point x = x(p) minimizes the function £ p (y) if and only if x 
satisfies the stationarity condition 



= Vf{x) + p^s i Vg i {x)+p^t :j Vh :j {x) 

»=1 3=1 



for coefficient sets {sj}£ =1 and {tj} a j—\- These sets can be characterized as 



{-1} 9l {x) < 

*> z - { [-1, 1] 9i(x) = 
{1} gi(x) > 



{0} h 3 (x)<0 
and I, ■ { [0, 1] hj(x) = 
{1} hj(x)>0 



At most one point achieves the minimum of £ p (y) for a given p when £ P (y) is strictly convex. 



(4) 



(5) 



Proof. According to Fermat's rule, x minimizes £ P (y) if and only if belongs to the subdifferential d£ p (x) 
of £ p (y). To derive the subdifferential displayed in equations Q and ([5]), one applies the addition and chain 
rules of the convex calculus. The sets defining the possible values of s, and tj are the subdifferentials of the 
functi ons |s| and t+ = ma xjt, 0}, respectively. For more details see Theorem 3.5 and ancillary material in the 



book ( Ruszczvhski 



20061 ). Finally, it is well known that strict convexity guarantees a unique minimum. □ 
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To speak coherently of solution paths, one must validate the existence, uniqueness, and continuity of 
the solution x(p) to the system of equations (|T|). Uniqueness follows from strict convexity as already noted. 
Existence and continuity are more subtle. 

Proposition 3.2. If £ p (y) is strictly convex and coercive, then the solution path x(p) of equation (Qp exists 
and is continuous in p. If the gradient vectors {Vg;(cc) : gi(x) = 0} U {Vhj(x) : hj(x) = 0} of the active 
constraints are linearly independent at x(p) for p > 0, then the coefficients Si(p) and tj(p) are unique and 
continuous near p as well. 

Proof. In accord with Proposition ^. 21 we assume that either f(x) is strictly convex and coercive or restrict 
our attention to the open interval (0,oo). Consider a subinterval [a, b] and fix a point x in the common 
domain of the functions £ p {y). The coercivity of £ a {y) and the inequalities 

£ a [x(p)} < £ P [x(p)} < £ p (x) < £ b (x) 

demonstrate that the solution vector x{p) is bounded over [a,b]. To prove continuity, suppose that it fails 
for a given p £ [a, 6], Then there exists an e > and a sequence p n tending to p such \\x(p n ) — x(p)\\2 > e for 
all n. Since x(p n ) is bounded, we can pass to a subsequence if necessary and assume that x(p n ) converges 
to some point y. Taking limits in the inequality £ Pn [x(p n )] < £ Pn (x) demonstrates that £ p (y) < £ P (x) for 
all x. Because x(p) is unique, we reach the contradictory conclusions \\y — x(p)\\2 > e and y = x(p). 

Verification of the second claim is deferred to permit further discussion of path following. The claim says 
that an active constraint {gi{x) — or hj(x) = 0) remains active until its coefficient hits an endpoint of its 
subdifferential. Because the solution path is, in fact, piecewise smooth, one can follow the coefficient path 
by numerically solving an ordinary differential equation (ODE). □ 

Our path following algorithm works segment-by-segment. Along the path we keep track of the following 
index sets 

7V E = {* : gi(x) < 0} M = {j : h 3 (x) < 0} 

Z E = {i : 9i (x) = 0} Zz = {j : hj(x) = 0} (6) 

P E = {i : 9i(x) > 0} Vi = {j : h s (x) > 0} 

determined by the signs of the constraint functions. For the sake of simplicity, assume that at the beginning 
of the current segment Sj does not equal —1 or 1 when i € and tj does not equal or 1 when j 6 2[. In 
other words, the coefficients of the active constraints occur on the interior of their subdiffcrentials. Let us 
show in this circumstance that the solution path can be extended in a smooth fashion. Our plan of attack is 
to reparameterize by the Lagrange multipliers for the active constraints. Thus, set Xi = psi for i g Ze and 

G 



ujj = ptj for j G Z\. The multipliers satisfy — p < Xi < p and < OJj < p. The stationarity condition now 
reads 

ieMs ie^E j'ePi 

+ ^ XiVgi(x) + >J LJ Vhj{x). 

iez E jeZj 

To this we concatenate the constraint equations = <7i(£c) for i € i?E and = hjix) for j £ Z]. 
For convenience now define 



dgz E (x) 

dh Zl (x) 



ieA/" B iev E jeVi 



Uz(x) 

In this notation the stationarity equation can be recast as 

= Wf(x)+pu z (x) + U z (x 



Under the assumption that the matrix U z (x) has full row rank, one can solve for the Lagrange multipliers 
in the form 



^z E 



pz{x)U* z {x)]- l U z {x) [V f{x) + pu z {x)] 



(7) 



Hence, the multipliers are unique. Continuity of the multipliers is a consequence of the continuity of the 
solution vector x(p) and all functions in sight on the right-hand side of equation ([7]). This observation 
completes the proof of Proposition 13.21 

Collectively the stationarity and active constraint equations can be written as the vector equation 



k(x, A, a;, p). To solve for a;, A and u) in terms of p, we apply the implicit function theorem (jLange 



M agnus and Neudecker 



2004: 



19991) . This requires calculating the differential of k(x, A,u>, p) with respect to the 
underlying dependent variables x, A, and u> and the independent variable p. Because the equality constraints 
are affine, a brief calculation gives 

d 2 f(x) + P E J ev 1 d 2 h 3 {x) + £ . £Zi u 3 d%(x) U z {x) 
U z {x) 





d Xi x iW k{x,X,u,p) 
d p k(x,\,u),p) 



The matrix d Xt \ tU ,k(x, A, u>, p) is non singular when its upper-left block is positive definite and its lower-left 



block has full row rank ([Lange , 



2010l Proposition 11.3.2). Given that it is nonsingular, the implicit function 



theorem applies, and we can in principle solve for x, A and u> in terms of p. More importantly, the implicit 
function theorem supplies the derivative 



d_ 

dp 



x 

^■Z E 
U>Zi 



= -d^x,uk(x,\,u:,p) d p k(x,\,u:,p), 
which is the key to path following. We summarize our findings in the next proposition. 



(8) 
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Proposition 3.3. Suppose the surrogate function £ p {y) is strictly convex and coercive. If at the point 
x(po) the matrix d x ,\, M k{x, A, (jJ, p) is nonsingular and the coefficient of each active constraints occurs on 
the interior of its subdifferential, then the solution path x(p) and Lagrange multipliers A(p) anduj(p) satisfy 
the differential equation fSJ^ in the vicinity ofx(po). 

In practice one traces the solution path along the current time segment until either an inactive constraint 
becomes active or the coefficient of an active constraint hits the boundary of its subdifferential. The earliest 
hitting time or escape time over all constraints determines the duration of the current segment. When the 
hitting time for an inactive constraint occurs first, we move the constraint to the appropriate active set 
or Z\ and keep the other constraints in place. Similarly, when the escape time for an active constraint occurs 
first, we move the constraint to the appropriate inactive set and keep the other constraints in place. In the 
second scenario, if Si hits the value —1, then we move i to Ae; If Si hits the value 1, then we move i to "Pe- 
Similar comments apply when a coefficient tj hits or 1. Once this move is executed, we commence path 
following along the new segment. Path following continues until for sufficiently large p, the sets Afe, ^Ei and 



V\ are exhausted, = 0, and the solution vector x(p) stabilizes. Our previous paper (jZhou and Lange 



2011bt > suggests remedies in the very rare situations where escape times coincide. 

Path following simplifies considerably in two special cases. Consider convex quadratic programming with 
objective function f(x) — ^x t Ax+b t x and equality constraints Vx = d and inequality constraints Wx < e, 



where A is positive semi-definite. The exact penalized objective function becomes 

t 

a-} Ax, 4- /)*.t; 4- nV \v l x -Al + n' 



w t j x-e j )+. 



i 

£ p (x) = -x t Ax + b t x + p22,\v\x- di\+ p2_,(i 

i=i j=i 

Since both the equality and inequality constraints are affine, their second derivatives vanish. Both U z and 
«l are constant on the current path segment, and the path x(p) satisfies 



d 

dp 



x 



U z J V 



(9) 



2011bh 



This implies that the solution path x(p) is piecewise linear. Our previous paper (jZhou and Lange . 
is devoted entirely to this special class of problems and highlights many statistical applications. 

On the next rung on the ladder of generality are convex programs with affine constraints. For the exact 
surrogate 

s t 

£ P ( X ) = f(x) + P^2\v t t x-d l \+p^2(w t j x-e 3 ) + , 
i=i j=i 

the matrix U z and vector are still constant along a path segment. The relevant differential equation 
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becomes 



d_ 

dp 



x 



d 2 f(x) U% 







(10) 



There are two approaches for computing the right-hand side of equation (flU)) . When A — d 2 f(x) is positive 
definite and B — U z has full row rank, the relevant inverse amounts to 



A B l 
B 



A~ l B t [BA~ l B t ]~ 1 BA~ l A^B^BA^B']' 1 

— \BA" 1 B t ]~ 1 



[BA^B^BA 1 

The numerical cost of computing the inverse scales as 0(n 3 )+0(|Z| 3 ). When d 2 f(x) is a constant, the inverse 
is computed once. Sequentially updating it for different active sets Z is then conveniently organized around 



the sweep operator of computational statistics ([Zhou and Lange , 



2011bf ). For a general convex function f(x), 



every time x changes, the inverse must be recomputed. This burden plus the cost of computing the entries 
of d 2 f(x) slow the path algorithm for general convex problems. 

In many applications f(x) is conve x but not necessarily s trictly convex. One can circumvent problems in 



inverting d 2 f(x) by reparameterizing (jNocedal and Wright 



20061 ) . For the sake of simplicity, suppose that 



all of the constraints are affine and that U z has full row rank. The set of points x satisfying the active 
constraints can be written as x = w + Yy, where to is a particular solution, y is free to vary, and the 
columns of Y S ]& nx ( n -\ z \) span the null space of U z and hence are orthogonal to the rows of U z- Under 
the null space reparameterization, 4^ 



Y-^-. Furthermore, 



Y t d 2 f(x)Y 



d 2 J(w + Yy) 



jeVi 



It follows that equation (|10l) becomes 

d 

dp' 
d r 
dp 



iY t d 2 f(x)Y]- 1 Y t uz 
-Y[Y t d 2 f(x)Y]- 1 Y t uz. 



Differentiating equation (J7|) gives the multiplier derivatives 



d_ 

dp 



-(UzU^-'Uz (d 2 f(x)^+uX 



(11) 



(12) 



The obvious advantage of using equation (fTTj) is that the matrix Y d 2 f(x)Y can be nonsingular when 
d 2 f(x) is singular. The computational cost of evaluating the right-hand sides of equations (fTT|) and (fT2|) 
is 0([n — \Z\] 3 ) + (9(|Z| 3 ). When n — \Z\ and \Z\ are small compared to n, this is an improvement over 
the cost 0(n 3 ) + 0(|Z| 3 ) of computing the right-hand side of equation (JHJ. Balanced against this gain is 



9 



the requirement of finding a basis of the null space of U z- Fortunately, the matrix Y is constant over each 
path segment and in practice can be computed by taking the QR decomposition of the active constraint 
matrix U z- At each kink of the solution path, either one constraint enters Z or o ne leaves. Therefore , 



Y can be sequentially compute d by standard updating and downdating formulas (jLawson and Hanson 



1987 



Nocedal and Wright 



2006). Which ODE © or (fTTj) is preferable depends on the specific application. 
When the loss function f(x) is not strictly convex, for example when the number of parameters exceeds 
the number of cases in regression , path following requires the ODE (ITT1) . Interested readers are referred 



to the book ( Nocedal and Wright 



2006) for a more extended discussion of range-space versus null-space 



optimization methods. 

For a general convex program, one can employ Euler's update 





-ApY 




~x(p)~ 


. d 


~x(p)~ 




-Ap) 




Mp) 


+ Ap— 

dp 


Hp) 


_u>(p- 


KAp) 




_o>0)_ 


_u{p)_ 



to advance the solution of the ODE ((SJ. Euler's formula may be inaccurate for Ap large. One can correct it 
by fixing p and performing one step of Newton's method to re-connect with the solution path. This amounts 
to replacing the position-multiplier vector by 



d x ,\,uk{x, X,u>,p) 1 k{x,\,LO,p). 



In practice, it is certainly easier and probably safer to rely on ODE packages such as the 0DE45 function in 
Matlab to advance the solution of the ODE. 

4 Examples of Path Following 

Our examples are intended to illuminate the mechanics of path following and showcase its versatility. As we 
emphasized in the introduction, we forgo comparisons with other methods. Comparisons depend heavily on 
programming details and problem choices, so a premature study might well be misleading. 

Example 4.1. Projection onto the Feasible Region 



Fi nding a f easibl e point is the initial stage in many convex programs. Dykstra's algorithm (jDvkstra 



1983 



Deutschl . 



20011) was designed precisely to solve the problem of projecting an exterior point onto the 
intersection of a finite number of closed convex sets. The projection problem also yields to our generic path 
following algorithm. Consider the toy example of projecting a point b £ M. 2 onto the intersection of the 



closed unit ball and the closed half space x\ > (jLange 

1„ 



2004 ). This is equivalent to solving 



minimize 



/(*) 



subject to hi(x) — ^IMI 2 — 2 ~ ^' h%i. x ) — —%i < 0. 



10 



The relevant gradients and second differentials are 



V/(x) = x-b, Vh 1 (x)=x, Vh 2 (x) 
d 2 f(x) = d 2 hi{x) = I 2 , d 2 h 2 {x) = Q. 



Path following starts from the unconstrained solution cc(0) = 6; the direction of movement is determined by 
formula ©. For x e {x : \\x\\ 2 > l,x± > 0}, the path 



dp 



-[(l + p)^]- 1 * 



1 



heads toward the origin. For x € {x : \x%\ > l,xi = 0}, the path 

/ i i „ n 1 \ — * 

dfx 
dp \ w 2 

also heads toward the origin. For x £ {x : \\x\\ 2 > l,x\ < 0}, the path 





1 + P 



X 2 





_i / Xi-1 



1 



dp LV ' r/ v 2:2 y i + p 

heads toward the point (1, 0)*. For x £ {x : ||cc|| 2 = 1, x\ < 0}, the path 



xi — 1 

X2 



dfx 
dp V w i 



1 + u>\ X\ 
1 + uji x 2 

X\ X2 




1+Wl 

X1X2 

l+wi 



is tangent to the circle. Finally, for a; e {a; : ||a;|| 2 < 1, x\ < 0}, the path 



d 



dp- ' z v 

heads toward the X2-axis. The left panel of Figure Q] plots the vector field -j^x at the time p — 0. The right 
panel shows the solution path for projection from the points (—2,0.5)*, (—2,1.5)*, (—1,2)*, (2,1.5)*, (2,0)*, 
(1, 2)*, and (—0.5, —2)* onto the feasible region. In projecting the point b = (—1, 2)* onto (0, 1)*, the ODE45 
solver of Matlab evaluates deri vatives at 19 different time points. Dykstra's algorithm by comparison takes 



about 30 iterations to converge (jLange 



2004). 



Example 4.2. Nonnegative Least Squares (NNLS) and Nonnegative Matrix Factorization (NNMF) 

Non-negative matrix factorization (NNMF) is an alternative to principle component analysis and is useful 
in mode ling, compressing, and interpreting nonnegativ e data such as observational counts and images. The 



articles (jBerrv et al 



2007 



Lee and Seung 



1999 



2001) discuss in detail estimation algorithms and statistical 



applications of NNMF. The basic idea is to approximate an m x n data matrix X = (xij) with nonnegative 
entries by a product VW of two low rank matrices V = (v^) and W = (wkj) with nonnegative entries. 



11 



2 
1.5 
1 

0.5 



-0.5 
-1 
-1.5 

-2 




' -» S _^^^\ \ \ N X ^ 

/ S / t I \ \ \ \ N V "V 
\ ! \ \ \ \ \ N V 

' / / / / \ \ \ \ \ WW 

'/ / / / \ \ \ \ \\ \\\ 



Figure 1: Projection to the positive half disk. Left: Derivatives at p = for projection onto the half disc. 
Right: Projection trajectories from various initial points. 

Here V and are m x r and r x n respectively, with r <C min{m, n}. One version of NNMF minimizes the 
criterion 



f(V,W) = \\X-VW\\ 2 F = J2T,( X ^-J2 V ^ 



(13) 



i j k 

where || • ||p denotes the Frobenius norm. In a typical imaging problem, m (number of images) might range 
from 10 3 to 10 4 , n (number of pixels per image) might surpass 10 4 , and a rank r = 50 approximation might 
adequately capture X. 

Minimization of the objective function fp~3|) is nontrivial because it is not j ointly convex in V and W . 



Multiple local minima are possible. The well-known multiplicative algorithm (|Lee and 



Seung 



1999 



Berry et al 



2001) 



2007) 



enjoys the descent property, but it is not guaranteed to converge to even a local minimum 
An alternative algorithm that exhibits better convergence is alternating least squares (ALS). In updating 
W with V fixed, ALS solves the n separated nonnegative least square (NLS) problems 



Vw 



J 112 



subject to Wj > 0, 



(14) 



where Xj and Wj denote the j-th columns of the corresponding matrices. Similarly, in updating V with W 
fixed, ALS solves m separated NNLS problems. The unconstrained solution W(0) = (V t Vy 1 V t X of W 
for fixed V requires just one QR decomposition of V or one Cholesky decomposition of V V. The exact 
path algorithm for solving the subproblem problem (fT4")l commences with W(0). If W(p) stabilizes with just 
a few zeros, then the path algorithm ends quickly and is extremely efficient. For a NNLS problem, the path 
is piecewise linear, and one can straightforwa rdly project the path to the next hitting or escape time using 



the sweep operator (jZhou and Lange 



2011a) . Figure [D shows a typical piecewise linear path for a problem 
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Figure 2: Piecewise linear paths of the regression coefficients for a NNLS problem with 50 predictors. 

with r — 50 predictors. Each projection to the next event requires 2r 2 flops. The number of path segments 
(events) roughly scales as the number of negative components in the unconstrained solution. 

Example 4.3. Quadratically Constrained Quadratic Programming (QC'QP) 

Example 14. II i s a special case of qua dratically constrained quadratic programming (QCQP). In convex 



QCQP (|Bovd and Vandenberghe . 



2004L Section 4.4), one minimizes a convex quadratic function over an 



intersection of ellipsoids and affine subspaces. Mathematically, this amounts to the problem 

minimize f(x) = —x t PoX + b l x + cq 
subject to gi(x) = a\x — d% = 0, i = 1, . . . , r 

hj(x) — -x f PjX + bjX + Cj < 0, j = l,...,s, 

where Pq is a positive definite matrix and the Pj are positive semidchnitc matrices. Our algorithm starts 
with the unconstrained minimum x(0) = —Pq 1 ^ and proceeds along the path determined by the derivative 



d_ 

dp 



x 



Uzix) 



u z (x) 




where U z{x) has rows a\ for i £ Zb and (PjX + bjY for j g Zi, and 

u z (x) = - a, ; + a, + (P 3 x + bj 

ieNs i£V E iEVi 



Affinc inequality constraints can be accommodated by setting one or more of the Pj equal to 0. 
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Figure 3: Trajectory of the exact penalty path algorithm for a QCQP problem (fT51) . The solid lines are the 
contours of the objective function f(x). The dashed lines are the contours of the constraint functions hj(x). 

As a numerical illustration, consider the bivariate problem 

1 

x\ + 

1\ 2 



minimize f{x) = ~x± + x 2 — x\x 2 + —x\ — 2x2 



subject to hi(x) — (xi ^ + x 2 — 1 < 

/ 1 \ 2 
h 2 (x) = (2:1+2) + x l- 1 <0 







(15) 



h 3 (x) 



1\ 2 

x 2 --) -1<0. 



Here the feasible region is given by the intersection of three disks with centers (0.5,0)*, (—0.5,0)', and 
(0, 0.5)', respectively, and a common radius of 1. Figure [3] displays the solution trajectory. Starting from the 
unconstrained minimum a;(0) = (1, 1.5)', it hits, slides along, and exits two circles before its journey ends 
at the constrained minimum (0.059,0.829)'. The 0DE45 solver of Matlab evaluates derivatives at 72 time 
points along the path. 

Example 4.4. Geometric Programming 



As a branch of convex optimizati on theory. 



quadratic programming in importance (|Bovd et al 



geometric programming stands just behind linear an d 



2007 



It ha s applications in chemi cal equilibri u m pro blems (|Passv and Wilde 



Eckci 



1980: 



Peressini et al 



1988; 



Peterson . 



1980), digit circuit design feovd et al 



19831 ). stochastic processes (jFeigin and Passv 



1968 ). structural mechanics (Eckcr 



2005), ma ximum likelihood estimation (Mazumdar and Jefferson 



1976) 



19811) . and a host of other subjects (jBovd et al 



2007 



Eckcr 
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1980). Geometric programming deals with posynomials, which are functions of the form 



In the left-hand definition of this equivalent pair of definitions, the index set S C M" is finite, and all co- 
efficients c a and all components Xi,...,x n of the argument a: of f{x) are positive. The possibly fractional 
powers cti corresponding to a particular a may be positive, negative, or zero. For instance, x^ 1 + 2xfx^~ 2 
is a posynomial on R 2 . In geometric programming, one minimizes a posynomial f(x) subject to posynomial 
inequality constraints of the form hj(x) < 1 for 1 < j < s. In some versions of geometric programming, 



equality constraints of monomial type are permitted (|Bovd et al 



2007). The right-hand definition in equa- 



tion (|16p invokes the exponential reparameterization Xi — e Vi . This simple transformation has the advantage 
of rendering a geometric program convex. In fact, any posynomial f(y) in the exponential parameterization 
is log-convex and therefore convex. The concise representations 



V/(y) = 5> a e at «a, d 2 f(y) = ]T c^ctc? 



: a c - ut' 

aes aes 
of the gradient and the second differential are helpful in both theory and computation. 

Without loss of generality, one can repose geometric programming as 

minimize ln/(y) 

subject to \ngi(y) = 0, 1 < i < r (17) 
\nhjiy) < 0, l<j<s, 

where f(y) and the hj(y) are posynomials and the equality constraints In gi(y) are affine. In this exponential 
parameterization setting, it is easy to state necessary and sufficient conditions for strict convexity and 
coerciveness. 

Proposition 4.5. The objective function f(y) in the geometric program |_?7| ) is strictly convex if and only 
if the subspace spanned by the vectors {a} a£ s is all ofW 1 ; f(y) is coercive if and only if the polar cone 
{z : z l a. < for all a G S} reduces to the origin 0. Equivalently, f(y) is coercive if the origin belongs to 
the interior of the convex hull of the set S. 



Proof. These claims are proved in detail in our paper (jZhou and Lange . 



2011a 



□ 



According to Propositions 12 . ll and !3 . 2\ the strict convexity and coerciveness of f(y) guarantee the unique- 
ness and continuity of the solution path in y. This in turn implies the uniqueness and continuity of the 
solution path in the original parameter vector x. The path directions are related by the chain rule 

T^AP) = -i--rVAP) = Xi-rVAP)- 
dp dyi dp dp 
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Figure 4: Trajectory of the exact penalty path algorithm for the geometric programming problem (j!8p . The 
solid lines are the contours of the objective function f(x). The dashed lines are the contours of the constraint 
function h(x) at levels 1, 1.25, and 1.5. 

As a concrete example, consider the problem 

minimize a;j~ 3 + 3x^X2 + x\xi (18) 
1 2 

subject to a x ^ 2 o X2 — 1' Xl > ^' X2 ^ ^' 

It is easy to check that the vectors {(—3,0)*, (—1, —2)*, (1, 1)*} span M. 2 and generate a convex hull strictly 
containing the origin 0. Therefore, f(y) is strictly convex and coercive. It achieves its unconstrained 
minimum at the point x(0) — (v6> v6)*, or equivalently y(0) = (In 6/5, ln6/5)*. To solve the constrained 
minimization problem, we follow the path dictated by the revised geometric program (117[) . Figure [4] plots 
the trajectory from the unconstrained solution to the constrained solution in the original x variables. The 
solid lines in the figure represent the contours of the objective function f(x), and the dashed lines represent 
the contours of the constraint function h(x). The DDE45 solver of Matlab evaluates derivatives at seven 
time points along the path. 

Example 4.6. Semidefinite Programming (SDP) 



The linear semidefinite programming problem (jVandenberghe and Bovd . 



19961 ) consists in minimizing 



the trace function X H> tr(CX) over the cone of positive semidefinite matrices 5? subject to the linear 
constraints ti(AiX) — hi for 1 < i < p. Here C and the Ai are assumed symmetric. According to Sylvester's 
criterion, the constraint X 6 S 1 ™ involves a complicated system of inequalities involving nonconvex functions. 
One way of cutting through this morass is to focus on the minimum eigenvalue v\(X) of X. Because the 
function —i>\(X) is convex, one can enforce positive semidcfinitcness by requiring —i>\{X) < 0. Thus, the 
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linear semidefinite programming problem is a convex program in the standard functional form. 

It simplifies matters enormously to assume that v\(X) has multiplicity 1. Let u be the unique, up to 
sign, unit eigenvector corresponding to i>i(X). The matrix X is parameterized by the entries of its lower 
triangle. With these conventions, the following formulas 

-ikr {x) = - ut ^ Xu (19) 
-d^r {x) = - ut ^ x{viI - xr ^ Xu 

- Ut ^- X{ " lI - Xr ^ XU 

= -2« f - — X{v x I-X)-- — Xu (20) 

OXij OXkl 



1999). Here 



for the first and second partial derivatives of — Vx(X) are well known (jMagnus and Neudecker . 
the matrix {v\I — X)~ is the Moore-Penrose inverse of V\I — X . The partial derivative of X with respect 
to its lower triangular entry xij equals Eij + lu^jxEji, where Eij is the matrix consisting of all 0's excepts 
for a 1 in position Note that u l Eij = i^e'- and E^u = uie^ for the standard unit vectors ej and e^. 

The second partial derivatives of X vanish. The Moore-Penrose inverse is most easily expressed in terms of 
the spectral decomposition of X . If we denote the «th eigenvalue of X by Vi and the corresponding ith unit 
eigenvector by Ui, then we have 



-Uiu\. 

, Vi - 



{x- Vl iy = 

Finally, the formulas 

tT{AiX)-bi = y^(Ai)fc fc Xfcfc + 2y^y^(Ai) fc ;Xfc; - h 

k k Kk 

d 

-[tr(AiX) - h] = (Ai) k i + l {k ^ l} (Ai)ik 



dx k i 

express the linear constraints and their partial derivatives in terms of the lower triangular entries of X . 

Initiating path following is problematic because tr(C-X') has minimum — oo. A good strategy is to amend 
the surrogate function £ p {x) by adding the term where e(p) is a smooth positive function that 

decreases to 0. Taking e(p) = e~ cp for c positive works well in practice. The new surrogate function 
tr(C.X') + ||p is strictly convex and possesses a unique minimum for all p > 0. In view of the 

identities = £. ^ 4 and tr(CX) = ^ Ej 

CijXij for X — (xij) and C — (cy), the initial condition 

X (0) = — e(0) _1 C is straightforward to deduce. 

Path following must be m odified to accommodate the new surrogate function. In the notation of 



(jMagnus and Neudecker 



1999|), let x = v(X) be the \n(n + 1) vector obtained from vec(X) by eliminating 
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Figure 5: Solution path of a semidefinite programming example. 

all supradiagonal entries, and let D be the n 2 x \n(n + 1) duplication matrix satisfying vec(X) = Dx. 
Applying the chain rule to the obvious identities ||-X"||| = xD l Dx and tr(CX) = vec(CYDx, one can 
extend the derivation of Proposition 13.31 and prove that 



d_ 

dp 



x 



e(p)D t D-u} Zx d 2 i' 1 (x)l {vi{x}=0} U* z 
U z 

de (p) n* 



D l Dx - E ieA A E £> t vec(A i ) + J2i£-p E £>'vec(A t ) - V^(a;)l {^i(X)<0} 

V 

Path following proceeds until all constraints are satisfied and e(p) is negligible. 

For didactic purposes, considering the problem of minimizing tr(CX) subject to 



tr(AiX) 



1, tr(A 2 X) = 2, and X e S[ 



where 

C=q l), A 1= (J °J, and A 2 = (° Q J 

Figure [5] displays the solution paths of the entries Xij of X and the minimum eigenvalue v\ . Here we use 
e(p) = e~ p . The path starts with X(0) — —C, hits, slides along, and exits various constraints, and ends at 
the constrained solution 

Example 4.7. Image Denoising 

Image analysis is another fertile field for path following. Here we explore how to restore or enhance images 
by removing noise. This example differs from previous examples in that the fully constrained solution is 
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trivial. The solution path itself is the object of interest. Suppose that w = (my) € jj" 1 *™ represents the 
recorded gray levels across a 2D array of pixels from a no isy image with tru e gray levels u = (uij). The 



well-known denoising model of Rudin-Osher-Fatemi (ROF) (jRudin et al 
regularized least squares criterion 

1 



19921) minimizes the total variation 



w-u\\l + pTV(u) 



(21) 



The total variation penalty serves to smooth the reconstructed image and preserve its edges. A similar effect 
can be achieved by replacing the isotropic penalty TV(tt) by the anisotropic penalty 



TVi(it) = ( K+i,j - Ujj \ + 



(22) 



In this example we focus on path following for the anisotropic penalty and a more general convex loss 
function f(u). The objective function is now 



/(«) + p||£Hi|| 



(23) 



For instance, the amended loss function /(it) = ^\\w — Ku\\2 with a Gaussian or motion blurring matrix K 
is appropriate in many im aging problem s. Poisson count data are relevant to image reconst ruction in X-ra y 



and positron tomography 



Lange 



20101 ) and to image denoising in certain circumstances (|Le et al 



2007) 



With Poisson noise, the least squares criterion is replaced by a negative loglikelihood. The difference matrix 
D captures the l\ penalty (|22[) . Note that the matrices w and u are now viewed as vectors. For an m x n 
2D image, the difference matrix D has Iran — m — n rows (penalties) and mn columns (pixels). This matrix 
is very sparse, with just 2(2mn — m — n) nonzero entries equal to ±1. When m and n are both at least 2, 
D has more rows than columns and a reduced column rank of mn — 1 . 

For sufficiently large p, the minimum of the objective functions ([21]) reduces to a constant vector (blank 
image) equal to the average value w of the my. The goal of image denoising is to find a p such that the 
recovered image is judged satisfactory by visual inspection or other more qu antitative criteria . Notable 



computational adv ances in solving this problem include Chambolle's algorithm (jChambolle 



Bregman iteration (jGoldstein and Osher 



2004f) and split 



20091) . These methods minimize the objective functions (|2lj) and 
(|23| for a fixed value of p. The web site of UCLA's Computational and Applied Math Group summarizes 
the most recent progress in this area. In reality, outer iterations are almost always required to tune the 
parameter p. Path following is an attractive option because it provides the whole solution path at about the 
same computational cost as recovering the solution for an individual p. 
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Although it is tempting to minimize the criterion (|23p by path following, the regularization matrix D 
has linearly dependent rows and deficient rank. Because the assumptions of Proposition 13.21 are violated, 
the multipliers Ae of the active constraints in equations ([7} and ^ are not uniquely determined. One 
can intuitively understand the difficulty by considering a square with four pixels. Whenever any three 
constraints are active, the fourth is automatically active as well. This constraint redundancy can be remedied 
by reparameterizing the model in terms of neighboring pixel differences x = Du. Unfortunately, the rank 
deficiency of D is also an issue. Adding the same constant to all of the components of u yields exactly the 
same x. To circumvent this problem, we simply append a bottom row to D with all entries except for a 1 
in the last position. If V is the amended version of D, then V has full column rank, and the vector x = Vu 
uniquely determines the image. Indeed, one can solve for x in the form u = (VV) _1 V'a;. The bottom 
entry of x is obviously the gray level of the last pixel of the image. 

Despite the presence of the inverse of the huge mnxmn matrix V* V, the transformation u = (V t V)~ 1 V t x 
is not as daunting as it appears. First of all, multiplication by the sparse matrix V is trivial. More impor- 
tantly, the matrix V V is symmetric, banded, and extremely sparse. To count its nonzero entries, note that 
except for diagonal entries, these entries occur in the same positions as the nonzero entries of the adjacency 
matrix of a corresponding graph with 2mn — m — n edges and mn nodes. Because an adjacency matrix has 
twice as many nonzero entries as edges, the matrix V 1 V has at most 2(2mn — m — n) + ran = 5mn — 2m — 2n 
nonzero entries. These occur within a band of width min{m, n} along the main diagonal, depending on 
whether we stack columns or concatenate rows. The most convenient way to solve equations of the kind 
V t Va = b is to extract the Cholesky decomposition L of V t V and execute forward and backward substi- 
tution. Although extraction of L is cheap for banded matrices, it is even cheaper for banded matrices with 
just a handful of nonzero entries per row. In our experience, the computational complexity of extracting L 
scales linearly in the product mn. Since L itself is sparse, forward and backward substitution are also very 
cheap. For instance with a 256 x 256 image, M ATLAB computes L (a 65536 x 65536 matrix) in 0.26 seconds 
on a laptop; L contains just 1,971,395 nonzero entries. The sparsity of L suggests that it be computed once 
and stored in compressed format for all images of a given size. Many of its nonzero entries are close to 
zero. Thus, a fairly light truncation of the non-diagonal entries of L gives an even sparser matrix realizing 
nearly the same transformation. Figure [5] displays the sparsity pattern of the matrix V V and its permuted 
Cholesky factor L for 64 x 64 images. Images of other sizes show similar sparsity patterns. 

The problem of minimizing the objective function h\\w — Ku\\2 + pH-DitHi in the transformed variable 
x turns out to co i ncide with lasso pen alized regression, for which an efficient path algorithm is known 



(Efron et al 



2004; 



Osborne et al 



20001 ) . Let us sketch how path following works in the more general case. 
The objective function is f(Bx) + p\\x-\\\, where B = (V t V)~ 1 V t and X- denotes the vector x with 
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Figure 6: Sparsity patterns of V t V and its Cholesky decomposition L for 64-by-64 images. 

its last entry deleted. The penalty contributions correspond to affine equality constraints in constrained 
minimization. In path following, the penalty constant p starts large and moves downward. The initial image 
is flat with gray level determined by taking a;_ = and adjusting the last entry of x to minimize f(Bx). 
Call this point x^. The first escape time occurs at p max — maxj \(B V^-BcCoo)^. At this juncture path 
following begins in earnest. Under the x parameterization, the loss function has gradient B*V f(Bx) and 
second differential B t d 2 f(Bx)B. Because f{Bx) is not strictly convex, our previous reparameterization 
from x to y variables is needed. Based on equation (fTl) . the path ODEs reduce to 

— x z = -(B%d 2 f(Bx)B 2 )- 1 sgn(xz), = 0, 

^-\ z = -B t z d 2 f{Bx)B z ^-x 2 . (24) 

Observe that the updates of equation ([TT]) drastically simplify because the rows of the active constraint 
matrix U z & n d the columns of its null space matrix Y are populated by standard Euclidean unit vectors. 
Furthermore, for the ROF model of image denoising, d 2 f(Bx) is a diagonal matrix. Alternatively, one can 
derive the ODE equations (1241) from first principles by implicitly differentiating the stationary conditions. 
Path following solves the coupled ODEs (j2"4")l segment by segment. 

For a quadratic loss function, the second differential is constant, and the solution path is piecewise linear. 
Thus no ODE solving is involved. With a blurring matrix K, the second differential is B l d 2 f(Bx)B — 
B t K t KB. After each path extension, the path directions ([24)) yield the next event time pj at which a 
nonzero component Xj hits zero, or a multiplier Xj of a zero component Xj hits p or —p. The path is then 
extended to the closest of these event times. In deblurring or denoising, the inverse of BzK'KBz is best 
computed via a QR decomposition oi B Z K. At each kink in the path, B Z K changes by adding or deleting a 
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Figure 7: A noisy image and snapshots along the regularization path. 



column of BK. As we men tioned earlier, it is straightforward to update or downdate the QR decomposition 



(Lawson and Hanson 



19871 ). In the original ROF model, traversing one time segment requires about 0(p) 
operations for p = mn total pixels. The whole process ends when T differences Xj becomes nonzero. In 
practice, a large value of T recovers too grainy an image, so T is typically much smaller than p. The total 
cost of computing the solution path is approximately 0(Tp), which is comparable to the cost of start-of-art 
algorithms for minimization at a single p. 

Figure [7] illustrates denoising of a 112 x 91 image of a lighthouse. The corrupted image appears in the 
top-left corner of the figure. The p — 10, 192 pixels generate 20, 182 transformed variables. It takes our 
Matlab script about one minute of desktop computing time to traverse T = 2, 500 segments along the 
regularization path from p — 87.9881 (blank image) to p — 0.5206 (a nearly optimal image). In the process, 
the lighthouse clearly emerges from the fog of oversmoothing. Figure [7] displays selected snapshots along the 
regularization path. We emphasize that path following based on equation (|2"5|) reveals the entire path for 
the interval [0.5206,87.9881] of p values. In practice, one can accelerate path following by starting from a p 
nearer to the ultimate destination. 
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5 Discussion 



Our path following algorithm for constraine d convex optimization build s on but differs from the tradi 



tion of path following in homo topy methods (jZangwill and Garcia 



19811 ) and interior point programming 



Bovd and Vandenberghd (|2004f ) . The paths encountered in the exact penalty method introduce the novelty of 



piecewise differentiability, which can be effectively handled by tracki ng the Langrange multipliers. Computa- 



tional statisticians deserve credit for explorin g this difficult terrain (jEfron et al 



Zhou and Lange 


2011b 


Zhou and Wu, 


2011) 



2004; 



Osborne et al 



2000: 



exact penalty methods. 

Our algorithm enjoys the dual advantages of simplicity and generality. Given the rich numerical resources 
of Matlab, it is straightforward to solve the required ODEs segment by segment. Regardless of whether 
path following is faster or slower than existing optimization methods, it supplies the whole solution path. 
In regularized estimation, this level of detail offers unprecedented insight into how penalties and predictors 
interact. Our example on image denoising is a case in point. 

In quadratic programming w ith affine equality and inequality constraints, the solution path is piecewise 



linear (jZhou and Lange , 



2011bl ). This permits path following to take large steps. Furthermore, each step 
can be implemented very efficiently by the sweep operator of computational statistics. Despite the loss 
of these advantages in more complicated examples, the real culprit in path-following deceleration in many 
applications is an excessive number of constraints to be navigated. Our image denoising example suffers 
from this defect. On the positive side of the ledger, in nonconvex problems path fo llowing may well prove to 



2010). 



be more reliable than competing methods in separating global from local minima (jZhou and Lange , 

Various extensions of path following are in order. First, the current algorithm commences from the 
unconstrained solution. Our development relies on the strict convexity and coerciveness of the objective 
function to ensure a unique starting point. In principle, path initiation should work for any problem with 
a unique unconstrained minimum. Similarly, path continuation should be possible whenever the interior 
solution is well defined and piecewise smooth. As the image denoising example suggests, reparametrization 
can play an important role in correcting defects in strict convexity. Another possibility is to amend the 
surrogate function £ p (x). In our semidefinite programming example, we add the term e _cp ||X||| to enforce 
strict convexity and coerciveness. A similar tactic obviously works in other examples. 

A second generalization is to expand the list of penalty functions. For instance, Euclidean penalties of 
the form || Mx + a.|| 2 are useful in grouping parameters in statistical problems. It should be straightforward 
to extend path following to include such penalties. A third generalization is to remove convexity restrictions 
altogether. As we have noted, the exact penalty method applies equally to nonconvex programming. Path 
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following in this setting is nontrivial since the solution path is no longer necessarily continuous. This poses 
a real challenge, and it is unclear to us whether one can construct a theory as satisfying as that standing 
behind modern interior point methods. We invite the optimization community to tackle this broader issue. 
In the meantime, we are happy to share our Matlab code with interested researchers. 
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